Read My Lips: Towards Use of the Microsoft Kinect as a Visual-Only Automatic Speech Recognizer
نویسندگان
چکیده
Consumer devices used in the home are capable of collecting ever more information from users, including audio and video. The Microsoft Kinect is particularly well-designed for tracking user speech and motion. In this paper, we explore the ability of current models of the Kinect to support use as an automatic speech recognizer (ASR). Lip reading is known to be difficult due to the many possible lip motions. Our goals were to quantify lip movement while observing the correlation with recognized words. Our preliminary results show that word recognition through the audio interface and with use of the Microsoft Speech API can provide upwards of 90% accuracy over a corpus of words, and that the visual acuity of the Kinect is such that we can capture a total of 22 data points representing the lip model through the Face Tracking API at a high resolution. Based on these results and that of recent work, we forecast that the Kinect has the ability to act as an ASR and that words can potentially be reconstructed through the observation of lip movement without the presence of sound. Such an ability for household devices to observe and parse communication presents a new set of privacy challenges within the home.
منابع مشابه
مدل میکروسکوپی دوگوشی مبتنی بر فیلتر بانک مدولاسیون برای پیش گویی قابلیت فهم گفتار در افراد دارای شنوایی عادی
In this study, a binaural microscopic model for the prediction of speech intelligibility based on the modulation filter bank is introduced. So far, the spectral criteria such as the STI and SII or other analytical methods have been used in the binaural models to determine the binaural intelligibility. In the proposed model, unlike all models of binaural intelligibility prediction, an automatic ...
متن کاملBilingual corpus for AVASR using multiple sensors and depth information
In this paper we present the Bilingual Audio-Visual Corpus with Depth information (BAVCD). The database contains utterances of connected digits, spoken by 15 subjects in English and 6 subjects in Greek, and collected employing multiple audio-visual sensors. Among them, of particular interest is the use of the Microsoft Kinect device, which is able to capture facial depth images using the struct...
متن کاملContinuous-speech phone recognition from ultrasound and optical images of the tongue and lips
The article describes a video-only speech recognition system for a “silent speech interface” application, using ultrasound and optical images of the voice organ. A one-hour audiovisual speech corpus was phonetically labeled using an automatic speech alignment procedure and robust visual feature extraction techniques. HMM-based stochastic models were estimated separately on the visual and acoust...
متن کاملLip Tracking Towards an Automatic Lip Reading Approach
Current era is to make the interaction between humans and their artificial partners (Computers) and make communication easier and more reliable. One of the actual tasks is the use of vocal interaction. Speech recognition may be improved by visual information of human face. In literature, the lip shape and its movement are referred to as lip reading. Lip reading computing plays a vital role in a...
متن کاملAudiovisual Speech Recognition with Articulator Positions as Hidden Variables
Speech recognition, by both humans and machines, benefits from visual observation of the face, especially at low signal-to-noise ratios (SNRs). It has often been noticed, however, that the audible and visible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recognition structures that allow asynchrony between the audible phoneme and the visible viseme outpe...
متن کامل